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There are at least two kinds of similarity. Relational similarity is correspondence between re- 
lations, in contrast with attributional similarity, which is correspondence between attributes. 
When two words have a high degree of attributional similarity, we call them synonyms. When 
two pairs of words have a high degree of relational similarity, we say that their relations are anal- 
ogous. For example, the word pair masomstone is analogous to the pair carpenteriwood. This 
paper introduces Latent Relational Analysis (LRA), a method for measuring relational similar- 
ity. LRA has potential applications in many areas, including information extraction, word sense 
disambiguation, and information retrieval. Recently the Vector Space Model (VSM) of informa- 
tion retrieval has been adapted to measuring relational similarity, achieving a score of 47% on a 
collection of 374 college-level multiple-choice word analogy questions. In the VSM approach, the 
relation between a pair of words is characterized by a vector of frequencies of predefined patterns 
in a large corpus. LRA extends the VSM approach in three ways: (1) the patterns are derived 
automatically from the corpus, (2) the Singular Value Decomposition (SVD) is used to smooth 
the frequency data, and (3) automatically generated synonyms are used to explore variations of 
the word pairs. LRA achieves 56% on the 374 analogy questions, statistically equivalent to the 
average human score of 57%. On the related problem of classifying semantic relations, LRA 
achieves similar gains over the VSM. 

1. Introduction 

There are at least two kinds of similarity. Attributional similarity is correspondence be- 
tween attributes and relational similarity is correspondence between relations I M edin, Goldstone, and Gentner, 1990t . 
When two words have a high degree of attributional similarity, we call them synonyms. 
When two word pairs have a high degree of relational similarity, we say they are analo- 
gous. 

Verbal analogies are often written in the form A:B::C:D, meaning A is to B as C is 
to D; for example, traffic:street::water:riverbed. Traffic flows over a street; water flows 
over a riverbed. A street carries traffic; a riverbed carries water. There is a high de- 
gree of relational similarity between the word pair traffic: street and the word pair wa- 
ter:riverbed. In fact, this analogy is the basis of several mathematical theories of traffic 
flow jDa ganzo, 19941. 

In Section 12 we look more closely at the connections between attributional and re- 
lational similarity. In analogies such as mason:stone::carpenter:wood, it seems that re- 
lational similarity can be reduced to attributional similarity, since mason and carpenter 
are attributionally similar, as are stone and wood. In general, this reduction fails. Con- 
sider the analogy traffic:street::water:riverbed. Traffic and water are not attributionally 
similar. Street and riverbed are only moderately attributionally similar. 
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Many algorithms have b een proposed f or measuring the attributional similarity be- 
tween two words |Lesk, 1969} Resnik, 1995[|Land auer and Du mais, 1997 | Jiang and Conrath, 1997} 



|Lin, 1998b"Turney, 2001 : Budanits ky and Hir~ 001 Banerje e and Pedersen, 20 031. Mea- 
sures of attributional similarity have been studied extensively, due to their applications 
in proble ms such as recognizing synonjnns l |Landauer an d Duma is7 1997L inform ation 
retrieval I Deerwester et al., 19901, determining semantic orientation I Turney, 20021, grad- 
ing student essays I Re hder et al., 1998 ), meas uring textual cohesion (Morris and Hirst, 1991t , 
and word sense disambiguation ^Lesk, 1986t . 

On the other hand, since measures of relational similarity are not as well developed 
as measures of attributional similarity, the potential applications of relational similarity 
are not as well known. Many problems that involve semantic relations would benefit 
from an algorithm for measuring relational similarity. We discuss related problems in 
natural language processing, information retrieval, and information extraction in more 
detail in Section|2| 

This paper builds on the Vector Space Model (VSM) of information retrieval. Given 
a query, a search engine produces a ranked list of documents. The documents are 
ranked in order of decreasing attributional similarity between the query and each doc- 
ument. Almost all modern search engines measure attributional similarity using the 
VSM (Baeza- Yates and Ribeiro-Neto, 1999) . Turney and Littman (2005j adapt the VSM 
approach to measuring relational similarity. They used a vector of frequencies of pat- 
terns in a corpus to represent the relation between a pair of words. Section |4l presents 
the VSM approach to measuring similarity. 

In Section |5l we present an algorithm for measuring relational similarity, which 
we call Latent Relational Analysis (LRA). The algorithm learns from a large corpus of 
unlabeled, unstructured text, without supervision. LRA extends the VSM approach 
of Turney and Littman (2005| in three ways: (1) The connecting patterns are derived 



automatically from the corpus, instead of using a fixed set of patterns. (2) Singular 
Value Decomposition (SVD) is used to smooth the frequency data. (3) Given a word 
pair such as traffic:street, LRA considers transformations of the word pair, generated by 
replacing one of the words by synonyms, such as traffic:road, traffic:highway. 

Section |6l presents our experimental evaluation of LRA with a collection of 374 
multiple-choice word analogy questions from the SAT college entrance exam.^ An ex- 
ample of a typical SAT question appears in TableQ] In the educational testing literature, 
the first pair (mason:stone) is called the stem of the analogy. The correct choice is called 
the solution and the incorrect choices are distractors. We evaluate LRA by testing its abil- 
ity to select the solution and avoid the distractors. The average performance of college- 
bound senior high school students on verbal SAT questions corresponds to an accuracy 
of about 57%. LRA achieves an accuracy of about 56%. On these same questions, the 
VSM attained 47%. 

One application for relational similarity is classifying semantic relations in noun- 
modifier pairs l |Turney and Littman, 2005) . In Section |7| we evaluate the performance 
of LRA with a set of 600 noun-modifier pairs from'Nastase and Szpakowicz (2003). The 
problem is to classify a noun-modifier pair, such as "laser printer", according to the 
semantic relation between the head noun (printer) and the modifier (laser). The 600 
pairs have been manually labeled with 30 classes of semantic relations. For example, 
"laser printer" is classified as instrument; the printer uses the laser as an instrument for 



^The College Board eliminated analogies from the SAT in 2005, apparently because it was believed that 
analogy questions discriminate against minorities, although it has been argued by liberals i Goldenberg, 2005| 
that dropping analogy questions has increased discrimination against minorities and by conservatives 
jKurtz, 2002 1 that it has decreased academic standards. Analogy questions remain an important component 
in many other tests, such as the GRE. 
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Table 1 




An example of a typical SAT question, from the collection of 374 questions. 


Stem: 


mason: stone 


Choices: 


(a) teacher:chalk 




(b) carpenter:wood 




(c) soldier:gun 




(d) photograph:camera 




(e) book:word 


Solution: 


(b) carpenter:wood 



printing. 

We approach the task of classifying semantic relations in noun-modifier pairs as a 
supervised learning problem. The 600 pairs are divided into training and testing sets 
and a testing pair is classified according to the label of its single nearest neighbour in the 
training set. LRA is used to measure distance (i.e., similarity, nearness). LRA achieves 
an accuracy of 39.8% on the 30-class problem and 58.0% on the 5-class problem. On the 
same 600 noun-modifier pairs, the VSM had accuracies of 27.8% (30-class) and 45.7% 
(5-class) iTurney and Liftman, 20051. 

We discuss the experimental results, limitations of LRA, and future work in Sec- 
tionlSjand we conclude in Section|9l 

2. Attributional and Relational Similarity 

In this section, we explore connections between attributional and relational similarity. 
2.1 Types of Similarity 

jMedin, Goldstone, and Centner (1990t distinguish attributes and relations as follows: 

Attributes are predicates taking one argument (e.g., X is red, X is large), whereas 
relations are predicates taking two or more arguments (e.g., X collides with Y, X 
is larger than Y). Attributes are used to state properties of objects; relations express 
relations between objects or propositions. 

Centner (1983^ notes that what coimts as an attribute or a relation can depend on the 
context. For example, large can be viewed as an attribute of X, LARGE(X), or a relation 
between X and some standard Y, LARGER_THAN(X, Y). 

The amount of attributional similarity between two words, A and B, depends on the 
degree of correspondence between the properties of A and B. A measure of attributional 
similarity is a function that maps two words, A and B, to a real number, sinia(A, B) € 
3?. The more correspondence there is between the properties of A and B, the greater 
their attributional similarity. For example, dog and if o// have a relatively high degree of 
attributional similarity. 

The amount of relational similarity between two pairs of words, A:B and C:D, de- 
pends on the degree of correspondence between the relations between A and B and 
the relations between C and D. A measure of relational similarity is a function that 
maps two pairs, A:B and C:D, to a real number, sinir(A : B,C : D) e 'St. The more cor- 
respondence there is between the relations of A:B and C;D, the greater their relational 
similarity. For example, dog:bark and catmeow have a relatively high degree of relational 
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similarity. 

Cognitive scientists distinguish words that are semantically associated (bee-honey) 
from words that are semantically similar (deer-pony), although they recognize that some 
words are both associated and similar (doctor-nurse) (Chiarello et al., 1990|) . Both of 
these are types of attributional similarity, since they are based on correspondence be- 
tween attributes (e.g., bees and honey are both found in hives; deer and ponies are both 
mammals). 

[Budanits ky and Hirst (200T| describe semantic relatedness as follows: 

Recent research on the topic in computational linguistics has emphasized the 
perspective of semantic relatedness of two lexemes in a lexical resource, or its 
inverse, semantic distance. It's important to note that semantic relatedness is a more 
general concept than similarity; similar entities are usually assumed to be related 
by virtue of their likeness {bank-trust company), but dissimilar entities may also be 
semantically related by lexical relationships such as meronymy {car-wheel) and 
antonymy {hot-cold), or just by any kind of functional relationship or frequent 
association {pencil-paper, penguin-Antarctica). 

As these examples show, semantic relatedness is the same as attributional similarity 
(e.g., hot and cold are both kinds of temperature, pencil and paper are both used for 
writing). Here we prefer to use the term attributional similarity, because it emphasizes 
the contrast with relational similarity. The term semantic relatedness may lead to confu- 
sion when the term relational similarity is also under discussion. 
[Resnik (1995^ describes semantic similarity as follows: 

Semantic similarity represents a special case of semantic relatedness: for example, 
cars and gasoline would seem to be more closely rela ted than, say, cars and 
bicycles, but the latter pair are certainly more similar. |Rada et al. (1989t suggest 
that the assessment of similarity in semantic networks can in fact be thought of as 
involving just taxonimic (IS-A) links, to the exclusion of other link types; that view 
will also be taken here, although admittedly it excludes some potentially useful 
information. 

Thus semantic similarity is a specific type of attributional similarity. The term semantic 
similarity is misleading, because it refers to a type of attributional similarity, yet rela- 
tional similarity is not any less semantic than attributional similarity. 

To avoid confusion, we will use the terms attributional similarity and relational sim- 
ilarity, following jMedin, Goldstone, and Centner (1990t. Instead of semantic similarity 
^Resnik, 1995t or semantically similar l |ChiareIIo et al., 1990t , we prefer the term taxonomi- 
cal similarity, which we take to be a specific type of attributional similarity. We interpret 
synonymy as a high degree of attributional similarity. Analogy is a high degree of rela- 
tional similarity. 

2.2 Measuring Attributional Similarity 

Algorithms for measuring attributional similarity can be lexicon-based jLesk, 1986||Budanitsky and Hirst, 2001 

Banerjee and Pedersen, 2003V corpus-based {Lesk, 1969 Landauer and Dumais, 1997"Lin, 1998a| 

Turney, 2001^ , or a hybrid of the two ^Resnik, 1995.:Jiang and Conrath, 1997^:^Turney et al., 2003| . 

Intuitively, we might expect that lexicon-based algorithms would be better at capturing 

synonymy than corpus-based algorithms, since lexicons, such as WordNet, explicitly 

provide synonymy information that is only implicit in a corpus. However, experiments 

do not support this intuition. 

Several algorithms have been evaluated using 80 multiple-choice synonym ques- 
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Table 2 

An example of a typical TOEFL question, from the collection of 80 questions. 

Stem: levied 

Choices: (a) imposed 

(b) believed 

(c) requested 

(d) correlated 
Solution: (a) imposed 



Table 3 

Performance of attributional similarity measures on the 80 TOEFL questions. (The average 
non-English US college applicant's performance is included in the bottom row, for comparison.) 



Reference 


Description 


Percent Correct 


Jarmasz and Szpakowicz (2003| 


best lexicon-based algorithm 


78.75 


Terra and Clarke (2003J 


best corpus-based algorithm 


81.25 


Turney et aL (2003J 


best hybrid algorithm 


97.50 


Landauer and Dumais (1997J 


average human score 


64.50 



tions taken from the Test of English as a Foreign Language (TOEFL). An example of 
one of the 80 TOEFL questions appears in Table |2 Table 01 shows the best performance 
on the TOEFL questions for each type of attributional similarity algorithm. The results 
support the claim that lexicon-based algorithms have no advantage over corpus-based 
algorithms for recognizing synonymy. 

2.3 Using Attributional Similarity to Solve Analogies 

We may distinguish near analogies (mason:stone::carpenter:wood) from/ar analogies 
(traffic:street::water:riverbed) (Centner, 1983 Medin, Goldstone, and Centner, 1 990''|. In 
an analogy A:B::C:D, where there is a high degree of relational similarity between A:B 
and C:D, if there is also a high degree of attributional similarity between A and C, and 
between B and D, then A:B::C:D is a near analogy; otherwise, it is a far analogy. 

It seems possible that SAT analogy questions might consist largely of near analogies, 
in which case they can be solved using attributional similarity measures. We could score 
each candidate analogy by the average of the attributional similarity, sima, between A 
and C and between B and D: 

scoie{A -.B-.-.C-.D) = ^{sim^{A, C) + sim^{B , D)) (1) 



This kind of approach was used in two of the thirteen modules in Turney et al. (2003| 
(see Section l3Al . 

To evaluate this approach, we applied several measures of attributional similarity to 
our collection of 374 SAT questions. The performance of the algorithms was measured 
by precision, recall, and F, defined as follows: 

number of correct guesses 

precision = ; — (2) 

total number of guesses made 
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Table 4 

Performance of attributional similarity measures on the 374 SAT questions. Precision, recall, and 
F are reported as percentages. (The bottom two rows are not attributional similarity measures. 
They are included for comparison.) 



Algorithm 


Type 


Precision 


Recall 


F 


Hirst and St-Onge (19981 


lexicon-based 


34.9 


32.1 


33.4 


Jiang and Conrath (1997j 


hybrid 


29.8 


27.3 


28.5 


Leacock and Chodorow (1998J 


lexicon-based 


32.8 


31.3 


32.0 


Lin (1998b^ 


hybrid 


31.2 


27.3 


29.1 


Resnik (19951 


hybrid 


35.7 


33.2 


34.4 


Turney (2001| 


corpus-based 


35.0 


35.0 


35.0 


Turney and Littman (2005J 


relational (VSM) 


47.7 


47.1 


47.4 


random 


random 


20.0 


20.0 


20.0 



number of correct guesses 

recall = (3) 

maximum possible number correct 

2 X precision x recall 

F = ^ . . ^ TT- (4) 

precision + recall 

Note that recall is the same as percent correct (for multiple-choice questions, with only 
zero or one guesses allowed per question, but not in general). 

Table |4| shows the experimental results for our set of 374 analogy questions. For 
example, using the algorithm of ,Hirst and St-Onge (1998| , 120 questions were answered 
correctly, 224 incorrectly, and 30 questions were skipped. When the algorithm assigned 
the same similarity to all of the choices for a given question, that question was skipped. 
The precision was 120/(120 + 224) and the recall was 120/(120 + 224 + 30). 

The first five algorithms in Table|4]are implemented in Pedersen's WordNet-Similarity 
package.^ The sixth algorithm (T urney, 200 1|| used the Waterloo MultiText System, as 
described in Terra and Clarke (2003^ 

The difference between the lowest performance l |Jiangand Conrath, 1997^ and ran- 
dom guessing is statistically significant with 95% confidence, according to the Fisher 
Exact Test ^Agresti, 1990^ . However, the difference between the highest performance 
j Turney, 2001| and the VSM approach jTurney and Littman, 2005) is also statistically 
significant with 95% confidence. We conclude that there are enough near analogies in 
the 374 SAT questions for attributional similarity to perform better than random guess- 
ing, but not enough near analogies for attributional similarity to perform as well as 
relational similarity. 

3. Related Work 

This section is a brief survey of the many problems that involve semantic relations and 
could potentially make use of an algorithm for measuring relational similarity. 

3.1 Recognizing Word Analogies 

The problem of recognizing word analogies is, given a stem word pair and a finite list 
of choice word pairs, select the choice that is most analogous to the stem. This problem 

^See http:/ / www, d .umn.edu / ~tpederse / sunila rity.htm]| 
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was first attempted by a system called Argus ^Reitman, 1965t , using a small hand-built 
semantic network. Argus could only solve the limited set of analogy questions that its 
programmer had anticipated. Argus was based on a spreading activation model and 
did not explicitly attempt to measure relational similarity. 

Turney et al. (2003^ combined 13 independent modules to answer SAT questions. 



The final output of the system was based on a weighted combination of the outputs of 
each individual module. The best of the 13 modules was the VSM, which is described 
in detail in Turney and Littman (20051. The VSM was evaluated on a set of 374 SAT 
questions, achieving a score of 47%. 

In contrast with the corpus-based approach of Turney and Littman (2005| , |Veale (2004^ 
applied a lexicon-based approach to the same 374 SAT questions, attaining a score of 
43%. Veale evaluated the quality of a candidate analogy A:B::C:D by looking for paths 
in WordNet, joining Ato B and C to D. The quality measure was based on the similarity 
between the A:B paths and the C;D paths. 

[Turne y (2005 1 introduced Latent Relational Analysis (LRA), an enhanced version 
of th e VSM approa ch, which reached 56% on the 374 SAT questions. Here we go be- 
yond Turney (2005| by describing LRA in more detail, performing more extensive ex- 
periments, and analyzing the algorithm and related work in more depth. 



3.2 Structure Mapping Theory 

French (2002 l cites Structure Mapping Theory (SMT) fGentner, 1983"! and its implemen- 
tation in the Structure Mapping Engine (SME) I F alkenhainer, Forbus, and Gentner, 1989t 
as the most influential work on modeling of analogy-making. The goal of computational 
modeling of analogy-making is to understand how people form complex, structured 
analogies. SME takes representations of a source domain and a target domain, and pro- 
duces an analogical mapping between the source and target. The domains are given 
structured propositional representations, using predicate logic. These descriptions in- 
clude attributes, relations, and higher-order relations (expressing relations between re- 
lations). The analogical mapping connects source domain relations to target domain 
relations. 

For example, there is an analogy between the solar system and Rutherford's model 
of the atom I Falkenhainer, Forbus, and Gentner, 1989t . The solar system is the source 
domain and Rutherford's model of the atom is the target domain. The basic objects in 
the source model are the planets and the sun. The basic objects in the target model are 
the electrons and the nucleus. The planets and the sun have various attributes, such 
as mass(sun) and mass(planet), and various relations, such as revolve(planet, sun) and 
attracts(sun, planet). Likewise, the nucleus and the electrons have attributes, such as 
charge(electron) and charge(nucleus), and relations, such as revolve(electron, nucleus) 
and attracts(nucleus, electron). SME maps revolve(planet, sun) to revolve(electron, nu- 
cleus) and attracts(sun, planet) to attracts(nucleus, electron). 

Each individual connection (e.g., from revolve(planet, sun) to revolve(electron, nu- 
cleus)) in an analogical mapping implies that the connected relations are similar; thus, 
SMT requires a measure of relational similarity, in order to form maps. Early versions 
of SME only mapped identical relations, but later versions of SME allowed similar, 
non-identical relations to match I Falkenhainer, 19901. However, the focus of research 
in analogy-making has been on the mapping process as a whole, rather than measuring 
the similarity between any two particular relations, hence the similarity measures used 
in SME at the level of individual connections are somewhat rudimentary. 

We believe that a more sophisticated measure of relational similarity, such as LRA, 
may enhance the performance of SME. Likewise, the focus of our work here is on the 
similarity between particular relations, and we ignore systematic mapping between sets 
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Table 5 



Metaphorical sentences froml Lakoff and Johnson (1980^ , rendered as SAT-style verbal analogies. 



Metaphorical sentence 



SAT-style verbal analogy 



He shot down all of my arguments. 

I demolished his argument. 

You need to budget your time. 

I've invested a lot of time in her. 

My mind just isn't operating today. 

Life has cheated me. 

Inflation is eating up our profits. 



aircraft:shoot_down::argument:refute 

building:demolish::argument:refute 

money:budget::time:schedule 

money:invest::time:allocate 

machine:operate::mind:think 

charlatan:cheat::life:disappoint 

animaheat: :inflation:reduce 



of relations, so LRA may also be enhanced by integration with SME. 
3.3 Metaphor 

Metaphorical language is very common in our daily life; so common that we are usually 
unaware of it I Lakof f and Johnson, 1980^ . [Centner et al. (2001| argue that novel metaphors 
are understood using analogy, but conventional metaphors are simply recalled from 
memory. A conventional metaphor is a metaphor that has become entrenched in our 
language jLakoff and Johnson, 1980^ . Dolan (1995^ describes an algorithm that can rec- 



ognize conventional metaphors, but is not suited to novel metaphors. This suggests 
that it may be fruitful to combine Dolan's 1 1995 1 algorithm for handling conventional 
metaphorical language with LRA and SME for handling novel metaphors. 

[Lakoff and Johnson (19801 give many examples of sentences in support of their claim 
that metaphorical language is ubiquitous. The metaphors in their sample sentences can 
be expressed using SAT-style verbal analogies of the form A:B::C:D. The first column 
in TablelSjis a list of sentences from Lakoff and Johnson (19801 and the second column 
shows how the metaphor that is implicit in each sentence may be made explicit as a 
verbal analogy. 

3.4 Classifying Semantic Relations 

The task of classifying semantic relations is to identify the relation between a pair of 
words. Often the pairs are restricted to noun-modifier pairs, but there are many inter- 
esting relations, such as antonymy, that do not occur in noun-modifier pairs. However, 
noun-modifier pairs are interesting due to their high frequency in English. For instance, 
WordNet 2.0 contains more than 26,000 noun-modifier pairs, although many common 

noun-modifiers are not in WordNe t, especially technical terms. 

Rosario and Hearst (200H and [Rosario, Hearst, and Fillmore (2002 t classify noim- 



modifier relations in the medical domain, using MeSH (Medical Subject Headings) and 
UMLS (Unified Medical Language System) as lexical resources for representing each 
noun-modifier pair with a feature vector. They trained a neural network to distinguish 
13 classes of semantic relations. Nastase and Szpak owicz (2003| explore a similar ap- 



proach to classifying general noun-modifier pairs (i.e., not restricted to a particular do- 
main, such as medicine), using WordNet and Roget's Thesaurus as lexical resources. 
[Vanderwend e7l994^ used hand-built rules, together with a lexical knowledge base, to 
classify noun-modifier pairs. 

None of these approaches explicitly involved measuring relational similarity, but 
any classification of semantic relations necessarily employs some implicit notion of 
relational similarity, since members of the same class must be relationally similar to 
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some extent. |Barker and Szpakowicz 11998) tried a corpus-based approach that explic- 
itly used a measure of relational similarity, but their measure was based on literal match- 
ing, which limited its ability to generalize. Moldovan et al. (20041 also used a measure 
of relational similarity, based on mapping each noun and modifier into semantic classes 
in WordNet. The noun-modifier pairs were taken from a corpus and the surrounding 
context in the corpus was used in a word sense disambiguation algorithm, to improve 
the mapping of the noun and modifier into WordNet. ^Turney and Liftman (2005) used 
the VSM (as a component in a single nearest neighbour learning algorithm) to measure 
relational similarity. We take the same approach here, substituting LRA for the VSM, in 
Section|7| 

|Lauer (1995) used a corpus-based approach (using the BNC) to paraphrase noun- 
modifier pairs, by inserting the prepositions of, for, in, at, on, from, with, and about. For 
example, reptile haven was paraphrased as haven for reptiles. Lapata and Keller (2004) 



achieved improved results on this task, by using the database of AltaVista's search en- 
gine as a corpus. 

3.5 Word Sense Disambiguation 

We believe that the intended sense of a polysemous word is determined by its semantic 
relations with the other words in the surrounding text. If we can identify the semantic 
relations between the given word and its context, then we can disambiguate the given 
word. Yarowsky's 1 1993 1 observation that collocations are almost always monosemous 
is evidence for this view. Federici, Montemagni, and Pirrelli (1997J present an analogy- 
based approach to word sense disambiguation. 

For example, consider the word plant. Out of context, plant could refer to an indus- 
trial plant or a living organism. Suppose plant appears in some text near /ood. A typical 
approach to disambiguating plant would compare the attributional similarity oi food 
and industrial plant to the a ttributional similarity oifood and living organism (Lesk, 1986J 



Banerjee and Pedersen, 2003j . In this case, the decision may not be clear, since industrial 



plants often produce food and living organisms often serve as food. It would be very 
helpful to know the relation between/ood and plant in this example. In the phrase "food 
for the plant", the relation between food and plant strongly suggests that the plant is a 
living organism, since industrial plants do not need food. In the text "food at the plant", 
the relation strongly suggests that the plant is an industrial plant, since living organ- 
isms are not usually considered as locations. Thus an algorithm for classifying semantic 
relations (as in Section|7) should be helpful for word sense disambiguation. 

3.6 Information Extraction 

The problem of relation extraction is, given an input document and a specific relation R, 
extract all pairs of entities (if any) that have the relation R in the document. The prob- 
lem was introduced as part of the Message Understanding Conferences (MUC) in 1998. 
[Zelenko, Aone, and Richardella (2003 1 present a kernel method for extracting the rela- 
tions person-affiliation and organization-location. For example, in the sentence "John Smith 
is the chief scientist of the Hardcom Corporation," there is a person-affiliation relation be- 
tween "John Smith" and "Hardcom Corporation" < Zelen ko, Aone, and Richard ella, 2003t . 
This is similar to the problem of classifying semantic relations (Section l3.4> , except that 
information extraction focuses on the relation between a specific pair of entities in a 
specific document, rather than a general pair of words in general text. Therefore an 
algorithm for classifying semantic relations should be useful for information extraction. 

In the VSM approach to classifying semantic relations I T urney and Liftman, 2005) , 
we would have a training set of labeled examples of the relation person-affiliation, for in- 
stance. Each example would be represented by a vector of pattern frequencies. Given a 
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specific document discussing "John Smith" and "Hardcom Corporation", we could con- 
struct a vector representing the relation between these two entities, and then measure 
the relational similarity between this unlabeled vector and each of our labeled training 
vectors. It would seem that there is a problem here, because the training vectors would 
be relatively dense, since they would presumably be derived from a large corpus, but 
the new unlabeled vector for "John Smith" and "Hardcom Corporation" would be very 
sparse, since these entities might be mentioned only once in the given document. How- 
ever, this is not a new problem for the Vector Space Model; it is the standard situation 
when the VSM is used for information retrieval. A query to a search engine is rep- 
resented by a very sparse vector whereas a document is represented by a relatively 
dense vector. There are well-known techniques in information retrieval for coping with 
this disparity, such as weighting schemes for query vectors that are different from the 
weighting schemes for document vectors ^Salton and Buckley, 1988^ . 

3.7 Question Answering 

In their paper on classifying semantic relations, 'Moldov an et al. ( 20041 suggest that an 
important application of their work is Question Answering. As defined in the Text 
REtrieval Conference (TREC) Question Answering (QA) track, the task is to answer 
simple questions, such as "Where have nuclear incidents occurred?", by retrieving a 
relevant document from a large corpus and then extracting a short string from the doc- 
ument, such as "The Three Mile Island nuclear incident caused a DOE policy crisis." 
jMoIdovan et al. ^2004 1 propose to map a given question to a semantic relation and then 
search for that relation in a corpus of semantically tagged text. They argue that the de- 
sired semantic relation can easily be inferred from the surface form of the question. A 
question of the form "Where ...?" is likely to be seeking for entities with a location rela- 
tion and a question of the form "What did ... make?" is likely to be looking for entities 
with a product relation. In Section|7| we show how LRA can recognize relations such as 
location and product (see Table [191. 

3.8 Automatic Thesaurus Generation 

pTearst (19921 presents an algorithm for learning hyponym (type of) relations from a 
corpus and Berland and Charniak (19991 describe how to learn meronym {part of) re- 
lations from a corpus. These algorithms could be used to automatically generate a 
thesaurus or dictionary, but we would like to handle more relations than hyponymy 
and meronymy. WordNet distinguishes more than a dozen semantic relations between 
words jFellbaum, 1998j and Nastase and Szpakowicz (2003 1 list 30 semantic relations 
for noun-modifier pairs. [Hearst (1992 1 and Berland and Charnia k (1999t use m anually 
generated rules to mine text for semantic relations. [Turney and Liftman (2005} also use 
a manually generated set of 64 patterns. 

LRA does not use a predefined set of patterns; it learns patterns from a large corpus. 
Instead of manually generating new rules or patterns for each new semantic relation, 
it is possible to automatically learn a measure of relational similarity that can handle 
arbitrary semantic relations. A nearest neighbour algorithm can then use this relational 
similarity measure to learn to classify according to any set of classes of relations, given 
the appropriate labeled training data. 

[Girju, Badulescu, and Moldov an (2003^ present an algorithm for learning meronym 
relations from a corpus. Like Hearst (1992) and Berland and Charniak (19991, they use 
manually generated rules to mine text for their desired relation. However, they supple- 
ment their manual rules with automatically learned constraints, to increase the precision 
of the rules. 
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3.9 Information Retrieval 

[Veale (20^ has developed an algorithm for recognizing certain types of word analo- 
gies, based on information in WordNet. He proposes to use the algorithm for ana- 
logical information retrieval. For example, the query "Muslim church" should return 
"mosque" and the query "Hindu bible" should return "the Vedas". The algorithm was 
designed with a focus on analogies of the form adjective:noun::adjective:noun, such as 
Christianxhurch: :Muslim:mosque. 

A measure of relational similarity is applicable to this task. Given a pair of words, 
A and B, the task is to return another pair of words, X and Y, such that there is high 
relational similarity between the pair A:X and the pair Y.B. For example, given A - 
"Muslim" and B = "church", return X = "mosque" and Y = "Christian". (The pair 
Muslim:mosque has a high relational similarity to the pair Christianxhurch.) 

,Marx et al. (2002j developed an unsupervised algorithm for discovering analogies 
by clustering words from two different corpora. Each cluster of words in one corpus 
is coupled one-to-one with a cluster in the other corpus. For example, one experiment 
used a corpus of Buddhist documents and a corpus of Christian documents. A cluster of 
words such as {Hindu, Mahayana, Zen, ...} from the Buddhist corpus was coupled with 
a cluster of words such as {Catholic, Protestant, ...} from the Christian corpus. Thus the 
algorithm appears to have discovered an analogical mapping between Buddhist schools 
and traditions and Christian schools and traditions. This is interesting work, but it is 
not directly applicable to SAT analogies, because it discovers analogies between clusters 
of words, rather than individual words. 



3.10 Identifying Semantic Roles 

A semantic frame for an event such as judgement contains semantic roles such as judge, 
evaluee, and reason, whereas an event such as statement contains roles such as speaker, ad- 
dressee, and message (Gildea and Jurafsky, 20021. The task of identifying semantic roles 
is to label the parts of a sentence according to their semantic roles. We believe that it 
may be helpful to view semantic frames and their semantic roles as sets of semantic 
relations; thus a measure of relational similarity should help us to identify semantic 
roles. [Moldovan et al. (2004 1 argue that semantic roles are merely a special case of se- 
mantic relations (Section l3.4t , since semantic roles always involve verbs or predicates, 
but semantic relations can involve words of any part of speech. 



4. The Vector Space Model 



This section examines past work on measuring attributional and relational similarity 
using the Vector Space Model (VSM). 

4.1 Measuring Attributional Similarity with the Vector Space Model 

The VSM was first developed for information retrieval jSalton and Mc Gill, 1983| Salton and Buckley, 1988 
ISalton, 19 89 1 and it is at the core of most modern search engines I Baeza- Yates an d Ribeiro -Neto, 1999t . 
In the VSM approach to information retrieval, queries and documents are represented 
by vectors. Elements in these vectors are based on the frequencies of words in the cor- 
responding queries and documents. The frequencies are usually transformed by var- 
ious formulas and weights, tailored to improve the effectiveness of the search engine 
(Salton, 19891. The attributional similarity between a query and a document is mea- 
sured by the cosine of the angle between their corresponding vectors. For a given query, 
the search engine sorts the matching documents in order of decreasing cosine. 

Th e VSM appr oach has also been used to measure the attributional similarity of 
words <Lesk, 1969| |Ruge, 1992| |Pantel and Lin, 2002t . [Pantel and Lin (2002t clustered 
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words according to their attributional similarity, as measured by a VSM. Their algo- 
rithm is able to discover the different senses of polysemous words, using unsupervised 
learning. 

Latent Semantic Analysis enhances the VSM approach to information retrieval by 

using the Singular Value Decomposition (SVD) to smooth the vectors, whic h helps to 

handle noise and sparseness in the data I Deerwe ster et al. , 1990 Dumais, 1993 : Landauer and Du mais, 1997\ . 
SVD improves both document-query attributional similarity me asures (Deerwester et al., 1990, 
[Dumais, 1993t and word-word attributional similarity measures jLandauer and Dumais, 1997\ . 
LRA also uses SVD to smooth vectors, as we discuss in Section|51 

4.2 Measuring Relational Similarity with the Vector Space Model 

Let i?i be the semantic relation (or set of relations) between a pair of words, A and B, 
and let R2 be the semantic relation (or set of relations) between another pair, C and 
D. We wish to measure the relational similarity between Ri and i?2- The relations Ri 
and i?2 are not given to us; our task is to infer these hidden (latent) relations and then 
compare them. 

In the VSM approach to relational similarity ^Turney and Littman, 2005) , we create 
vectors, ri and r2, that represent features of i?i and R2, and then measure the similarity 
of Ri and R2 by the cosine of the angle 6 between ri and r2 ■ 

ri = (r-i,!, . . . ,ri,„) (5) 
r2 = {r2,i, ■ ■ ■r2,n) (6) 

n 

cosme(t^j = — , = = , ^ = -J- — |- — |- — 17 (7) 

/n " Jri ■ ri ■ Jr2 ■ r2 ri • r2 

y i=l i=l 

We create a vector, r, to characterize the relationship between two words, X and Y, 
by counting the frequencies of various short phrases containing X and Y. [Turney and Littman (2005} 
use a list of 64 joining terms, such as "of", "for", and "to", to form 128 phrases that con- 
tain X and Y, such as "X of Y", "Y of X" , "X for Y" , "Y for X" , "X to Y", and "Y to 
X". These phrases are then used as queries for a search engine and the number of hits 
(matching documents) is recorded for each query. This process yields a vector of 128 
numbers. If the number of hits for a query is x, then the corresponding element in the 
vector r is log(x + 1). Several authors report that the logarithmic transformation of fre- 
quencies improves cosine-based similarity measures l |Salton and Buckley, 1988 Ruge, 1992 
rLin 1998bj. 

[Turney and Littman (2005) evaluated the VSM approach by its performance on 374 
SAT analogy questions, achieving a score of 47%. Since there are five choices for each 
question, the expected score for random guessing is 20%. To answer a multiple-choice 
analogy question, vectors are created for the stem pair and each choice pair, and then 
cosines are calculated for the angles between the stem pair and each choice pair. The 
best guess is the choice pair with the highest cosine. We use the same set of analogy 
questions to evaluate LRA in Section|6l 

The VSM was also evaluated by its performance as a distance (nearness) mea- 
sure in a supervised nearest neighbour classifier for noun-modifier semantic relations 
jTurney and Littman, 2005). The eval uation used 600 hand-labeled noxm-modifier pairs 
from |Nastase and Szpakowicz (2003) . A testing pair is classified by searching for its 
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single nearest neighbour in the labeled training data. The best guess is the label for 
the training pair with the highest cosine. LRA is evaluated with the same set of noun- 
modifier pairs in Section|7| 

[Turney and Littman (2005j used the AltaVista search engine to obtain the frequency 
information required to build vectors for the VSM. Thus their corpus was the set of all 
web pages indexed by AltaVista. At the time, the English subset of this corpus consisted 
of about 5 X 10^^ words. Aroimd April 2004, AltaVista made substantial changes to 
their search engine, removing their advanced search operators. Their search engine no 
longer supports the asterisk operator, which was used by Turney a nd Littman (2005| 
for stemming and wild-card searching. AltaVista also changed their policy towards 
aut omated searching, which is now forbidden.^ 

IJurney and Littman (2005^ used AltaVista's hit count, which is the number of docu- 
ments (web pages) matching a given query, but LRA uses the number of passages (strings) 
matching a query. In our experiments with LRA (Sections|51and|7J, we use a local copy of 
the Waterloo MultiText System I Clarke, Cormack, and Palmer, 1998 : Terra a nd Clarke, 2003t , 
rtmning on a 16 CPU Beowulf Cluster, with a corpus of about 5 x 10^° English words. 
The Waterloo MultiText System (WMTS) is a distributed (multiprocessor) search engine, 
designed primarily for passage retrieval (although document retrieval is possible, as a 
special case of passage retrieval). The text and index require approximately one ter- 
abyte of disk space. Although AltaVista only gives a rough estimate of the number of 
matching documents, the Waterloo MultiText System gives exact counts of the number 
of matching passages. 

! Turney et al. (2003| combine 13 independent modules to answer SAT questions. The 
ormance of LRA significantly surpasses this combined system, but there is no real 
contest between these approaches, because we can simply add LRA to the combination, 
as a fourteenth module. Since the VSM module had the best performance of the thirteen 
modules l |Turney et al., 2003} , the following experiments focus on comparing VSM and 
LRA. 

5. Latent Relational Analysis 

LRA takes as input a set of word pairs and produces as output a measure of the rela- 
tional similarity between any two of the input pairs. LRA relies on three resources, a 
search engine with a very large corpus of text, a broad-coverage thesaurus of synonyms, 
and an efficient implementation of SVD. 

We first present a short description of the core algorithm. Later, in the following 
subsections, we will give a detailed description of the algorithm, as it is applied in the 
experiments in Sections|6land|7| 

• Given a set of word pairs as input, look in a thesaurus for synonyms for each 
word in each word pair. For each input pair, make alternate pairs by replacing 
the original words with their synonyms. The alternate pairs are intended to 
form near analogies with the corresponding original pairs (see Section lT^ . 

• Filter out alternate pairs that do not form near analogies, by dropping alternate 
pairs that co-occur rarely in the corpus. In the preceding step, if a synonym 
replaced an ambiguous original word, but the synonym captures the wrong 
sense of the original word, it is likely that there is no significant relation between 
the words in the alternate pair, so they will rarely co-occur. 

^See http: / /www.altavista.com/robots.txt for AltaVista's current policy towards "robots" (software for 
automatically gathering web pages or issuing batches of queries). The protocol of the "robots.txt" file is 
explained in http:/ / www.robotstxt.org/wc/robots.htmll 
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• For each original and alternate pair, search in the corpus for short phrases that 
begin with one member of the pair and end with the other. These phrases char- 
acterize the relation between the words in each pair. 

• For each phrase from the previous step, create several patterns, by replacing 
words in the phrase with wild cards. 

• Build a pair-pattern frequency matrix, in which each cell represents the number 
of times that the corresponding pair (row) appears in the corpus with the cor- 
responding pattern (column). The number will usually be zero, resulting in a 
sparse matrix. 

• Apply the Singular Value Decomposition to the matrix. This reduces noise in 
the matrix and helps with sparse data. 

• Suppose that we wish to calculate the relational similarity between any two of 
the original pairs. Start by looking for the two row vectors in the pair-pattern 
frequency matrix that correspond to the two original pairs. Calculate the co- 
sine of the angle between these two row vectors. Then merge the cosine of the 
two original pairs with the cosines of their corresponding alternate pairs, as fol- 
lows. If an analogy formed with alternate pairs has a higher cosine than the 
original pairs, we assume that we have found a better way to express the anal- 
ogy, but we have not significantly changed its meaning. If the cosine is lower, 
we assume that we may have changed the meaning, by inappropriately replac- 
ing words with synonyms. Filter out inappropriate alternates by dropping all 
analogies formed of alternates, such that the cosines are less than the cosine for 
the original pairs. The relational similarity between the two original pairs is 
then calculated as the average of all of the remaining cosines. 

The motivation for the alternate pairs is to handle cases where the original pairs co- 
occur rarely in the corpus. The hope is that we can find near analogies for the original 
pairs, such that the near analogies co-occur more frequently in the corpus. The danger 
is that the alternates may have different relations from the originals. The filtering steps 
above aim to reduce this risk. 

5.1 Input and Output 

In our experiments, the input set contains from 600 to 2,244 word pairs. The output 
similarity measure is based on cosines, so the degree of similarity can range from —1 
(dissimilar; 6 = 180°) to +1 (similar; = 0°). Before applying SVD, the vectors are 
completely nonnegative, which implies that the cosine can only range from to -|-1, but 
SVD introduces negative values, so it is possible for the cosine to be negative, although 
we have never observed this in our experiments. 

5.2 Search Engine and Corpus 

In the following experiments, we use a local copy of the Waterloo MultiText System 
jClarke, Cormack, and Palmer, 1998 Terra and Clarke, 20031.* The corpus consists of 
about 5 X 10^" English words, gathered by a web crawler, mainly from US academic 
web sites. The web pages cover a very wide range of topics, styles, genres, quality, and 
writing skill. The WMTS is well suited to LRA, because the WMTS scales well to large 
corpora (one terabyte, in our case), it gives exact frequency counts (unlike most web 
search engines), it is designed for passage retrieval (rather than document retrieval), 
and it has a powerful query syntax. 

''See http:/ /multitext.uwaterloo.ca/ 
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5.3 Thesaurus 

As a source of synonjnns, we use Lin's il998a> automatically generated thesaurus. This 
thesaurus is available through an online interactive demonstration or it can be down- 
loaded.^ We used the online demonstration, since the downloadable version seems to 
contain fewer words. For each word in the input set of word pairs, we automatically 
query the online demonstration and fetch the resulting list of synonyms. As a courtesy 
to other users of Lin's online system, we insert a 20 second delay between each query. 

Lin's thesaurus was generated by parsing a corpus of about 5 x 10^ English words, 
consisting of text from the Wall Street Journal, San Jose Mercury, and AP Newswire 
(|Lin, 199 8a I. The parser was used to extract pairs of words and their grammatical re- 
lations. Words were then clustered into s5monym sets, based on the similarity of their 
grammatical relations. Two words were judged to be highly similar when they tended 
to have the same kinds of grammatical relations with the same sets of words. Given 
a word and its part of speech, Lrn's thesaurus provides a list of words, sorted in or- 
der of decreasing attributional similarity. This sorting is convenient for LRA, since it 
makes it possible to focus on words with higher attributional similarity and ignore 
the rest. WordNet, in contrast, given a word and its part of speech, provides a list of 
words grouped by the possible senses of the given word, with groups sorted by the 
frequencies of the senses. WordNet's sorting does not directly correspond to sorting by 
degree of attributional similarity, although various algorithms have been proposed for 
deriving attributional similarity from WordNet iResnik, 1995} Jiang and Conrath, 1997 



[Budanitsky and Hirst, 2001„Banerjee and Pedersen, 2003f . 

5.4 Singular Value Decomposition 

We use Rohde's SVDLIBC implementation of the Singular Value Decomposition, which 
is based on SVDPACKC {Berry, 1992^ .^ In LRA, SVD is used to reduce noise and com- 
pensate for sparseness. 

5.5 The Algorithm 

We will go through each step of LRA, using an example to illustrate the steps. As- 
s ume that the input to LRA is the 374 multiple-choice SAT word analogy questions 



of Turney and Liftman (2005f . Since there are six word pairs per question (the stem and 
five choices), the input consists of 2,244 word pairs. Let's suppose that we wish to calcu- 
late the relational similarity between the pair quart:volume and the pair mile:distance, 
taken from the SAT question in Table |6| The LRA algorithm consists of the following 
twelve steps: 

1. Find alternates: For each word pair A:B in the input set, look in Lin's <1998a> 
thesaurus for the top nurn.sim words (in the following experiments, num.sim 
is 10) that are most similar to A. For each A' that is similar to A, make a new 
word pair A':B. Likewise, look for the top num.sim words that are most similar 
to B, and for each B', make a new word pair A:B'. A:B is called the original pair 
and each A':B or A:B' is an alternate pair. The intent is that alternates should 
have almost the same semantic relations as the original. For each input pair, 
there will now be 2 x num_sim alternate pairs. When looking for similar words 
in Lrn's 1 1998a I thesaurus, avoid words that seem unusual (e.g., hyphenated 
words, words with three characters or less, words with non-alphabetical char- 



^The online demonst ration is at |http://www.cs.ualberta. ca/~lindek/demos/depsim.htm| and the 

downloadable version is at http://armena.cs.uaiberta.ca/iindeic/downloads/sims.lsp.gz 

'^SVDLIBC is available at http://tediab.mit.edu/--dr/SVDLIBC/ and SVDPACKC is available at 
|http:/ /www.netlib.org/ svdpack/( 
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Table 6 

This SAT question, from |Claman (2000) , is used to illustrate the steps in the LRA algorithm. 



Stem: quart:volume 



Choices: 


(a) 


day:mght 




(b) 


mile:distance 




(c) 


decade:century 




(d) 


friction:heat 




(e) 


part:whole 


Solution: 


(b) 


mile:distance 



acters, multi-word phrases, and capitalized words). The first column in Table 12 
shows the alternate pairs that are generated for the original pair quart:volume. 

2. Filter alternates: For each original pair A:B, filter the 2 x num.sim alternates 
as follows. For each alternate pair, send a query to the WMTS, to find the 
frequency of phrases that begin with one member of the pair and end with 
the other. The phrases cannot have more than maxjphrase words (we use 
max-phrase — 5). Sort the alternate pairs by the frequency of their phrases. 
Select the top num_filter most frequent alternates and discard the remainder 
(we use nunn_f liter = 3, so 17 alternates are dropped). This step tends to 
eliminate alternates that have no clear semantic relation. The third column in 
Table |7| shows the frequency with which each pair co-occurs in a window of 
maxjphrase words. The last column in Table|7|shows the pairs that are selected. 

3. Find phrases: For each pair (originals and alternates), make a list of phrases 
in the corpus that contain the pair. Query the WMTS for all phrases that begin 
with one member of the pair and end with the other (in either order). We ig- 
nore suffixes when searching for phrases that match a given pair. The phrases 
cannot have more than maxjphrase words and there must be at least one word 
between the two members of the word pair. These phrases give us information 
about the semantic relations between the words in each pair. A phrase with 
no words between the two members of the word pair would give us very little 
information about the semantic relations (other than that the words occur to- 
gether with a certain frequency in a certain order). Table|51gives some examples 
of phrases in the corpus that match the pair quart:volume. 

4. Find patterns: For each phrase found in the previous step, build patterns from 
the intervening words. A pattern is constructed by replacing any or all or none 
of the intervening words with wild cards (one wild card can only replace one 
word). If a phrase is n words long, there are n — 2 intervening words between 
the members of the given word pair (e.g., between quart and volume). Thus a 
phrase with n words generates 2'"^^^ patterns. (We use maxjphrase = 5, so a 
phrase generates at most eight patterns.) For each pattern, coimt the number 
of pairs (originals and alternates) with phrases that match the pattern (a wild 
card must match exactly one word). Keep the top numjpatterns most frequent 
patterns and discard the rest (we use numjpatterns — 4, 000). Typically there 
will be millions of patterns, so it is not feasible to keep them all. 

5. Map pairs to rows: In preparation for building the matrix X, create a mapping 
of word pairs to row numbers. For each pair A:B, create a row for A:B and 
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Table 7 

Alternate forms of the original pair quart:volume. The first column shows the original pair and 
the alternate pairs. The second column shows Lin's similarity score for the alternate word 
compared to the original word. For example, the similarity between quart and pint is 0.210. The 
third coluiim shows the frequency of the pair in the WMTS corpus. The fourth column shows 
the pairs that pass the filtering step (i.e., step 2). 



VVOrCl pall 


Diiiiiiarity 


r requency 


jrutermg step 


quart:volume 




ooJ. 


accept (original pair) 


pint:volume 


U.21U 


3/2 




gallon:volume 


0.159 


1500 


accept (top alternate) 


liter:volume 


(J. 122 


3323 


accept (top alternate) 


squirtivolume 


U.Uo4 






paiI:volume 


0.084 


28 




vial:voIume 


0.084 


373 




pumping:volume 


0.073 


1386 


accept (top alternate) 


ounce:volume 


0.071 


430 




spoonfuhvolume 


0.070 


42 




tablespoon:volume 


0.069 


96 




quart:turnover 


0.229 







quart:output 


0.225 


34 




quart:export 


0.206 


7 




quart:value 


0.203 


266 




quart:import 


0.186 


16 




quart:revenue 


0.185 







quart:sale 


0.169 


119 




quart:investment 


0.161 


11 




quart:eamings 


0.156 







quart:profit 


0.156 


24 





Table 8 

Some examples of phrases that contain quart:volume. Suffixes are ignored when searching for 
matching phrases in the WMTS corpus. At least one word must occur between quart and 
volume. At most maxjphrase words can appear in a phrase. 



"quarts liquid volume" 


"volume in quarts" 


"quarts of volume" 


"volume capacity quarts" 


"quarts in volume" 


"volume being about two quarts" 


"quart total volume" 


"voltmae of milk in quarts" 


"quart of spray volume" 


"volume include measures like quart" 
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Table 9 

Frequencies of various patterns for quart:volume. The asterisk "*" represents the wildcard. 
Suffixes are ignored, so "quart" matches "quarts". For example, "quarts in volume" is one of the 
four phrases that match "quart P volume" when P is "in". 



P = "in" 


P = "* of" 


P = "of *" 


p _ "* *" 


freq("quart P volume") 4 


1 


5 


19 


freq("volume P quart") 10 





2 


16 



another row for B:A. This will make the matrix more symmetrical, reflecting 
our knowledge that the relational similarity between A:B and C:D should be 
the same as the relational similarity between B:A and D:C. This duplication of 
rows is examined in Section lS^ 

6. Map patterns to columns: Create a mapping of the top numjpatterns patterns 
to column numbers. For each pattern P, create a column for "wordi P word2" 
and another column for "word2 P wordi" . Thus there will be 2 x numjpatterns 
columns in X. This duplication of columns is examined in Section l6^ 

7. Generate a sparse matrix: Generate a matrix X in sparse matrix format, suit- 
able for input to SVDLIBC. The value for the cell in row i and column j is 
the frequency of the j-th pattern (see step 6) in phrases that contain the i-th 
word pair (see step 5). Table|5]gives some examples of pattern frequencies for 
quart:volume. 

8. Calculate entropy: Apply log and entropy transformations to the sparse ma- 
trix jLandauer and Dumais, 1997^ . These transformations have been found to 
be very helpful for information retrieval i Harman, 1986 Dum ais, 1990t . Let Xi_j 
be the cell in row i and column j of the matrix X from step 7. Let m be the 
number of rows in X and let n be the number of columns. We wish to weight 
the cell Xi,j by the entropy of the j-th column. To calculate the entropy of the 
column, we need to convert the column into a vector of probabilities. Let pij 
be the probability of x^j , calculated by normalizing the column vector so that 
the sum of the elements is one, pi^j = a^ij/X^I-Li ^kj- The entropy of the j-th 
column is then Hj = — X^fcLiPfej log(pfej). Entropy is at its maximum when 
Pij is a uniform distribution, pij — l/m, in which case Hj = log(m). Entropy 
is at its minimum when j is 1 for some value of i and for all other values 
of i, in which case Hj ~ 0. We want to give more weight to columns (pat- 
terns) with frequencies that vary substantially from one row (word pair) to the 
next, and less weight to columns that are uniform. Therefore we weight the 
cell Xi,j by = 1 — Hj/ log(m), which varies from when pi^j is uniform to 
1 when entropy is minimal. We also apply the log transformation to frequen- 
cies, \og{xi^j + 1). (Entropy is calculated with the original frequency values, 
before the log transformation is applied.) For all i and all j, replace the orig- 
inal value Xi^j in X by the new value Wj \og{xi^j + 1). This is an instance of 
the TF-IDF (Term Frequency-Inverse Document Frequency) family of transfor- 
mations, which is familiar in information retrieval I Salton and Buckley, 1988| 
IBaeza- Yates and Ribeiro-Neto, 1999t : log(a;i j + 1) is the TF term and Wj is the 
IDF term. 

9. Apply SVD: After the log and entropy transformations have been applied to 
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the matrix X, run SVDLIBC. SVD decomposes a matrix X into a product of 
three matrices USV^, where U and V are in column orthonormal form (i.e., 
the columns are orthogonal and have unit length: U^U = V^V = I) and S 
is a diagonal matrix of singular values (hence SVD) l |Golub and Van Loan, 1996t . 
If X is of rank r, then E is also of rank r. Let Sfc, where fc < r, be the di- 
agonal matrix formed from the top k singular values, and let and be 
the matrices produced by selecting the corresponding columns from U and V. 
The matrix UfeS^Vl^ is the matrix of rank k that best approximates the orig- 
inal matrix X, in the sense that it minimizes the approximation errors. That 



is, X = UfeEfcV^ minimizes 



X-X 



over all matrices X of rank k, where 

II . . .11 p denotes the Frobenius norm <Golub and Van Loan, 1996t . We may think 
of this matrix UfcEfeV^ as a "smoothed" or "compressed" version of the orig- 
inal matrix. In the subsequent steps, we will be calculating cosines for row 
vectors. For this purpose, we can simplify calculations by dropping V. The 
cosine of two vectors is their dot product, after they have been normalized to 
unit length. The matrix XX^ contains the dot products of all of the row vec- 
tors. We can find the dot product of the i-th and j-th row vectors by look- 
ing at the cell in row i, column j of the matrix XX^. Since V^V — I, we 
have XX^ = UI]V'^(USV^)^ = UEV^VS^U^ = US(UE)^, which means 
that we can calculate cosines with the smaller matrix UE, instead of using 
X = UEV^ Peerwester et al., 1990t . 

10. Projection: Calculate UfeEfe (we use k ~ 300). This matrix has the same number 
of rows as X, but only k columns (instead of 2 x num_patterns columns; in 
our experiments, that is 300 columns instead of 8,000). We can compare two 
word pairs by calculating the cosine of the corresponding row vectors in UfcSfc. 
The row vector for each word pair has been projected from the original 8,000 
dimensional space into a new 300 dimensional space. The value k = 300 is 
recommended by Landauer and Dumais (1997 1 for measuring the attributional 
similarity between words. We investigate other values in Section l6!4l 

11. Evaluate alternates: Let A:B and C:D be any two word pairs in the input 
set. From step 2, we have {num.filter + 1) versions of A:B, the original and 
num_f liter alternates. Likewise, we have {num.filter + 1) versions of C:D. 
Therefore we have {num_f ilter + 1)^ ways to compare a version of A:B with 
a version of C:D. Look for the row vectors in U^Efc that correspond to the 
versions of A:B and the versions of C:D and calculate the {num_filter + 1)^ 
cosines (in our experiments, there are 16 cosines). For example, suppose A:B 
is quart:volume and C:D is mile:distance. Table ITUl gives the cosines for the 
sixteen combinations. 

12. Calculate relational similarity: The relational similarity between A:B and C:D 
is the average of the cosines, among the {num_f liter + 1)^ cosines from step 11, 
that are greater than or equal to the cosine of the original pairs, A:B and C:D. 
The requirement that the cosine must be greater than or equal to the original 
cosine is a way of filtering out poor analogies, which may be introduced in step 
1 and may have slipped through the filtering in step 2. Averaging the cosines, 
as opposed to taking their maximum, is intended to provide some resistance to 
noise. For quart:volume and mile:distance, the third column in Table [TUI shows 
which alternates are used to calculate the average. For these two pairs, the 
average of the selected cosines is 0.677. In Table|7| we see that pumping:volume 
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Table 10 

The sixteen combinations and their cosines. A:B::C:D expresses the analogy "A is to B as C is to 
D" . The third column indicates those combinations for which the cosine is greater than or equal 
to the cosine of the original analogy, quart:volume::mile:distance. 



Word pairs 


Cosine 


Cosine > original pairs 


quart:volume::mile:distance 


0.525 


yes (original pairs) 


quart:volume::feet:distance 


0.464 




quart:volume::mile:length 


0.634 


yes 


quart:volume::length:distance 


0.499 




liter:volume::mile:distance 


0.736 


yes 


liter : volume : : f eet : distance 


0.687 


yes 


liter : volume : : mile : length 


0.745 


yes 


liter:volume::length:distance 


0.576 


yes 


gallon:volume::mile:distance 


0.763 


yes 


gallon:volume::feet:distance 


0.710 


yes 


gallon:volume::mile:length 


0.781 


yes (highest cosine) 


gallon:volume::length:distance 


0.615 


yes 


pumping:volume::mile:distance 


0.412 


pumping:volume::feet:distance 


0.439 




pumping:volume::mile:length 


0.446 




pumping:volume::length:distance 


0.491 





has slipped through the filtering in step 2, although it is not a good alternate 
for quart:volume. However, Table [TOl shows that all four analogies that involve 
pumping:volume are dropped here, in step 12. 

Steps 11 and 12 can be repeated for each two input pairs that are to be compared. This 
completes the description of LRA. 

Table [TTl gives the cosines for the sample SAT question. The choice pair with the 
highest average cosine (the choice with the largest value in column #1), choice (b), is the 
solution for this question; LRA answers the question correctly. For comparison, column 
#2 gives the cosines for the original pairs and column #3 gives the highest cosine. For 
this particular SAT question, there is one choice that has the highest cosine for all three 
columns, choice (b), although this is not true in general. Note that the gap between the 
first choice (b) and the second choice (d) is largest for the average cosines (column #1). 
This suggests that the average of the cosines (column #1) is better at discriminating the 
correct choice than either the original cosine (column #2) or the highest cosine (column 
#3). 

6. Experiments with Word Analogy Questions 

This section presents various experiments with 374 multiple-choice SAT word analogy 
questions. 

6.1 Baseline LRA System 

Table [T2I shows the performance of the baseline LRA system on the 374 SAT questions, 
using the parameter settings and configuration described in Section |51 LRA correctly 
answered 210 of the 374 questions. 160 questions were answered incorrectly and 4 ques- 
tions were skipped, because the stem pair and its alternates were represented by zero 
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Table 11 

Cosines for the sample SAT question given in Table|5| Column #1 gives the averages of the 
cosines that are greater than or equal to the original cosines (e.g., the average of the cosines that 
are marked "yes" in TablellOlis 0.677; see choice (b) in column #1). Column #2 gives the cosine 
for the original pairs (e.g., the cosine for the first pair in TablellOlis 0.525; see choice (b) in column 
#2). Column #3 gives the maximum cosine for the sixteen possible analogies (e.g., the maximum 
cosine in TablellOlis 0.781; see choice (b) in column #3). 









Average 


Original 


Highest 








cosines 


cosines 


cosines 


Stem: 




quart:volume 


#1 


#2 


#3 


Choices: 


(a) 


day:night 


0.374 


0.327 


0.443 




(b) 


mile:distance 


0.677 


0.525 


0.781 




(c) 


decade:century 


0.389 


0.327 


0.470 




(d) 


friction:heat 


0.428 


0.336 


0.552 




(e) 


part:whole 


0.370 


0.330 


0.408 


Solution: 


(b) 


mile:distance 


0.677 


0.525 


0.781 


Gap: 


(b)-(d) 




0.249 


0.189 


0.229 



Table 12 

Performance of LRA on the 374 SAT questions. Precision, recall, and F are reported as 
percentages. (The bottom five rows are included for comparison.) 



Algorithm 


Precision 


Recall 


F 


LRA 


56.8 


56.1 


56.5 


Veale (20041 


42.8 


42.8 


42.8 


best attributional similarity 


35.0 


35.0 


35.0 


random guessing 


20.0 


20.0 


20.0 


lowest co-occurrence frequency 


16.8 


16.8 


16.8 


highest co-occurrence frequency 


11.8 


11.8 


11.8 



v ectors. The pe rformance of LRA is significantly better than the lexicon-based approach 
of [Veale (2004 1 (see Section l3Jt and the best performance using attributional similarity 
(see Section lTSl . with 95% confidence, according to the Fisher Exact Test (Agresti, 1990j. 

As another point of reference, consider the simple strategy of always guessing the 
choice with the highest co-occurrence frequency. The idea here is that the words in 
the solution pair may occur together frequently, because there is presumably a clear 
and meaningful relation between the solution words, whereas the distractors may only 
occur together rarely, because they have no meaningful relation. This strategy is signif- 
cantly worse than random guessing. The opposite strategy, always guessing the choice 
pair with the lowest co-occurrence frequency, is also worse than random guessing (but 
not significantly). It appears that the designers of the SAT questions deliberately chose 
distractors that would thwart these two strategies. 

With 374 questions and 6 word pairs per question (one stem and five choices), there 
are 2,244 pairs in the input set. In step 2, introducing alternate pairs multiplies the 
number of pairs by four, resulting in 8,976 pairs. In step 5, for each pair A:B, we add 
B:A, yielding 17,952 pairs. However, some pairs are dropped because they correspond 
to zero vectors (they do not appear together in a window of five words in the WMTS 
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Table 13 

LRA elapsed run time. 



Step 


Description 


Time H:M:S 


Hardware 


1 


Find alternates 


24:56:00 


1 CPU 


2 


Filter alternates 


0:00:02 


ICPU 


3 


Find phrases 


109:52:00 


16 CPUs 


4 


Find patterns 


33:41:00 


1 CPU 


5 


Map pairs to rows 


0:00:02 


1 CPU 


6 


Map patterns to columns 


0:00:02 


1 CPU 


7 


Generate a sparse matrix 


38:07:00 


1 CPU 


8 


Calculate entropy 


0:11:00 


ICPU 


9 


Apply SVD 


0:43:28 


1 CPU 


10 


Projection 


0:08:00 


1 CPU 


11 


Evaluate alternates 


2:11:00 


1 CPU 


12 


Calculate relational similarity 


0:00:02 


1 CPU 


Total 




209:49:36 





Table 14 

LRA versus VSM with 374 SAT analogy questions. 



Algorithm 


Correct 


Incorrect 


Skipped 


Precision 


Recall 


F 


VSM-AV 


176 


193 


5 


47.7 


47.1 


47.4 


VSM-WMTS 


144 


196 


34 


42.4 


38.5 


40.3 


LRA 


210 


160 


4 


56.8 


56.1 


56.5 



corpus). Also, a few words do not appear in Lin's thesaurus, and some word pairs 
appear twice in the SAT questions (e.g., lion:cat). The sparse matrix (step 7) has 17,232 
rows (word pairs) and 8,000 columns (patterns), with a density of 5.8% (percentage of 
nonzero values). 

Table [T3I gives the time required for each step of LRA, a total of almost nine days. 
All of the steps used a single CPU on a desktop computer, except step 3, finding the 
phrases for each word pair, which used a 16 CPU Beowulf cluster. Most of the other 
steps are parallelizable; with a bit of programming effort, they could also be executed 
on the Beowulf cluster. All CPUs (both desktop and cluster) were 2.4 GHz Intel Xeons. 
The desktop computer had 2 GB of RAM and the cluster had a total of 16 GB of RAM. 

6.2 LRA versus VSM 

TablelTHcompares LRA to the Vector Space Model with the 374 analogy questions. VSM- 
AV refers to the VSM using AltaVista's database as a corpus. The VSM-AV results are 
taken from Turney and Liftman (20051. As mentioned in Section we estimate this 



corpus contained about 5 x 10^^ English words at the time the VSM-AV experiments 
took place. VSM-WMTS refers to the VSM using the WMTS, which contains about 5 x 
10^° English words. We generated the VSM-WM TS results by adapting th e VSM to the 
WMTS. The algorithm is slightly different from Tur ney and Liftman (2005f , because we 
used passage frequencies instead of document frequencies. 

All three pairwise differences in recall in Table El are statistically significant with 
95% confidence, using the Fisher Exact Test jAgresti, 1990| . The pairwise differences in 
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Table 15 

Comparison with human SAT performance. The last column in the table indicates whether (YES) 
or not (NO) the average human performance (57%) falls within the 95% confidence interval of 
the corresponding algorithm's performance. The confidence intervals are calculated using the 
Binomial Exact Test jAgresti, 1990| . 



System 


Recall 
(% correct) 


95% confidence 
interval for recall 


Human-level 
(57%) 


VSM-AV 


47.1 


42.2-52.5 


NO 


VSM-WMTS 


38.5 


33.5-43.6 


NO 


LRA 


56.1 


51.0-61.2 


YES 



precision between LRA and the two VSM variations are also significant, but the differ- 
ence in precision between the two VSM variations (42.4% versus 47.7%) is not signifi- 
cant. Although VSM-AV has a corpus ten times larger than LRA's, LRA still performs 
better than VSM-AV. 

Comparing VSM-AV to VSM-WMTS, the smaller corpus has reduced the score of 
the VSM, but much of the drop is due to the larger number of questions that were 
skipped (34 for VSM-WMTS versus 5 for VSM-AV). With the smaller corpus, many more 
of the input word pairs simply do not appear together in short phrases in the corpus. 
LRA is able to answer as many questions as VSM-AV, although it uses the same corpus 
as VSM-WMTS, because Lin's thesaurus allows LRA to substitute synonyms for words 
that are not in the corpus. 

VSM-AV required 17 days to process the 374 analogy questions I Turne y and Littman, 2005^ , 
compared to 9 days for LRA. As a courtesy to AltaVista, [Turney and Littman (2005| in- 
serted a five second delay between each query. Since the WMTS is rimning locally, there 
is no need for delays. VSM-WMTS processed the questions in only one day. 

6.3 Human Performance 

The average performance of college-bound senior high school students on verbal SAT 
questions corresponds to a recall (percent correct) of about 57% I T urney and Littman, 2005) . 
The SAT I test consists of 78 verbal questions and 60 math questions (there is also an SAT 
II test, covering specific subjects, such as chemistry). Analogy questions are only a sub- 
set of the 78 verbal SAT questions. If we assume that the difficulty of our 374 analogy 
questions is comparable to the difficulty of the 78 verbal SAT I questions, then we can 
estimate that the average college-bound senior would correctly answer about 57% of 
the 374 analogy questions. 

Of our 374 SAT questions, 190 are from a collection of ten official SAT tests jClaman, 2000t . 
On this subset of the questions, LRA has a recall of 61.1%, compared to a recall of 51.1% 
on the other 184 questions. The 184 questions that are not from .Claman (2000 j seem to 
be more difficult. This indicates that we may be underestima ting how well LRA per- 



forms, relative to college-bound senior high school students. Claman (2000 1 suggests 
that the analogy questions may be somewhat harder than other verbal SAT questions, 
so we may be slightly overestimating the mean human score on the analogy questions. 

Table HHI gives the 95% confidence intervals for LRA, VSM-AV, and VSM-WMTS, 
calculated by the Binomial Exact Test (Agresti, 1990^ . There is no significant difference 
between LRA and human performance, but VSM-AV and VSM-WMTS are significantly 
below human-level performance. 



23 



Computational Linguistics 



Volume 1, Number 1 



Table 16 

Variation in performance with different parameter values. The Baseline column marks the 
baseline parameter values. The Step column gives the step number in Section l?3l where each 
parameter is discussed. 



Parameter Baseline 


Value 


Step 


Precision 


Recall 


F 


num^sim 


5 


1 


54.2 


53.5 


53.8 


num^sim =^ 


10 


1 


56.8 


56.1 


56.5 


num^sim 


15 


1 


54.1 


53.5 


53.8 


maxjphrase 


4 


2 


55.8 


55.1 


55.5 


maxjphrase 


5 


2 


56.8 


56.1 


56.5 


maxjphrase 


6 


2 


56.2 


55.6 


55.9 


num-f liter 


1 


2 


54.3 


53.7 


54.0 


nurri-f liter 


2 


2 


55.7 


55.1 


55.4 


num.f liter 


3 


2 


56.8 


56.1 


56.5 


num-f liter 


4 


2 


55.7 


55.1 


55.4 


num.f liter 


5 


2 


54.3 


53.7 


54.0 


numjpatterns 


1000 


4 


55.9 


55.3 


55.6 


numjpatterns 


2000 


4 


57.6 


57.0 


57.3 


numjpatterns 


3000 


4 


58.4 


57.8 


58.1 


numjpatterns ^ 


4000 


4 


56.8 


56.1 


56.5 


numjpatterns 


5000 


4 


57.0 


56.4 


56.7 


numjpatterns 


6000 


4 


57.0 


56.4 


56.7 


numjpatterns 


7000 


4 


58.1 


57.5 


57.8 


k 


100 


10 


55.7 


55.1 


55.4 


k => 


300 


10 


56.8 


56.1 


56.5 


k 


500 


10 


57.6 


57.0 


57.3 


k 


700 


10 


56.5 


55.9 


56.2 


k 


900 


10 


56.2 


55.6 


55.9 



6.4 Varying the Parameters in LRA 

There are several parameters in the LRA algorithm (see Section l53l . The parameter 
values were determined by trying a small number of possible values on a small set of 
questions that were set aside. Since LRA is intended to be an unsupervised learning 
algorithm, we did not attempt to tune the parameter values to maximize the precision 
and recall on the 374 SAT questions. We hjrpothesized that LRA is relatively insensitive 
to the values of the parameters. 

Table [161 shows the variation in the performance of LRA as the parameter values 
are adjusted. We take the baseline parameter settings (given in Section 1531 and vary 
each parameter, one at a time, while holding the remaining parameters fixed at their 
baseline values. None of the precision and recall values are significantly different from 
the baseline, according to the Fisher Exact Test I Agresti, 1990 1, at the 95% confidence 
level. This supports the hypothesis that the algorithm is not sensitive to the parameter 
values. 

Although a full run of LRA on the 374 SAT questions takes nine days, for some 
of the parameters it is possible to reuse cached data from previous rims. We limited 
the experiments with numsim and maxjphrase because caching was not as helpful for 
these parameters, so experimenting with them required several weeks. 
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Table 17 

Results of ablation experiments. 





LRA 






LRA 






baseline 


LRA 


LRA 


no SVD 






system 


no SVD 


no synonyms 


no synonyms 


VSM-WMTS 




#1 


#2 


#3 


#4 


#5 


Correct 


210 


198 


185 


178 


144 


Incorrect 


160 


172 


167 


173 


196 


Skipped 


4 


4 


22 


23 


34 


Precision 


56.8 


53.5 


52.6 


50.7 


42.4 


Recall 


56.1 


52.9 


49.5 


47.6 


38.5 


F 


56.5 


53.2 


51.0 


49.1 


40.3 



6.5 Ablation Experiments 

As mentioned in the introduction, LRA extends the VSM approach of Turney and Littman (2005| 
by (1) exploring variations on the analogies by replacing words with S5monyms (step 1), 
(2) automatically generating connecting patterns (step 4), and (3) smoothing the data 
with SVD (step 9). In this subsection, we ablate each of these three components to as- 
sess their contribution to the performance of LRA. TablellTlshows the results. 

Without SVD (compare column #1 to #2 in Table II 7> , performance drops, but the 
drop is not statistically significant with 95% confidence, according to the Fisher Exact 
Test ^Agresti, 1990^ . However, we hypothesize that the drop in performance would 
be significant with a larger set of word pairs. More word pairs would increase the 
sample size, which would decrease the 95% confidence interval, which would likely 
show that SVD is making a significant contribution. Furthermore, more word pairs 
would increase the matrix size, which would give SVD more leverage. For example. 



Landauer and Dumais (19971 apply SVD to a matrix of of 30,473 columns by 60,768 
rows, but our matrix here is 8,000 columns by 17,232 rows. We are currently gather- 
ing more SAT questions, to test this hypothesis. 

Without synonyms (compare column #1 to #3 in Table IT71 . recall drops significantly 
(from 56.1% to 49.5%), but the drop in precision is not significant. When the S5monym 
component is dropped, the number of skipped questions rises from 4 to 22, which 
demonstrates the value of the synonym component of LRA for compensating for sparse 
data. 

When both SVD and synonyms are dropped (compare colunm #1 to #4 in Table ITtI. 
the decrease in recall is significant, but the decrease in precision is not significant. Again, 
we believe that a larger sample size would show the drop in precision is significant. 

If we eliminate both synonyms and SVD from LRA, all that distinguishes LRA from 
VSM-WMTS is the patterns (step 4). The VSM approach uses a fixed list of 64 patterns 
to generate 128 dimensional vectors ( Turn ey and Littman, 2005j , whereas LRA uses a 
dynamically generated set of 4,000 patterns, resulting in 8,000 dimensional vectors. We 
can see the value of the automatically generated patterns by comparing LRA without 
synonyms and SVD (column #4) to VSM-WMTS (column #5). The difference in both pre- 
cision and recall is statistically significant with 95% confidence, according to the Fisher 
Exact Test l Agresti, 19901. 

The ablation experiments support the value of the patterns (step 4) and synonyms 
(step 1) in LRA, but the contribution of SVD (step 9) has not been proven, although 
we believe more data will support its effectiveness. Nonetheless, the three components 
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together result in a 16% increase in F (compare #1 to #5). 

6.6 Matrix Symmetry 

We know a priori that, if A:B::C:D, then B:A::D:C. For example, "mason is to stone as 
carpenter is to wood" implies "stone is to mason as wood is to carpenter". Therefore a 
good measure of relational similarity, sinii, should obey the following equation: 

sim,{A:B,C:D) ^ sm\.{B:A,D:C) (8) 

In steps 5 and 6 of the LRA algorithm (Section 15. 5> , we ensure that the matrix X is 
symmetrical, so that equation JH) is necessarily true for LRA. The matrix is designed so 
that the row vector for A:B is different from the row vector for B:A only by a permutation 
of the elements. The same permutation distinguishes the row vectors for C;D and D;C. 
Therefore the cosine of the angle between A:B and CD must be identical to the cosine 
of the angle between B:A and D:C (see equation 

To discover the consequences of this design decision, we altered steps 5 and 6 so 
that symmetry is no longer preserved. In step 5, for each word pair A:B that appears in 
the input set, we only have one row. There is no row for B:A unless B:A also appears in 
the input set. Thus the number of rows in the matrix dropped from 17,232 to 8,616. 

In step 6, we no longer have two columns for each pattern P, one for "wordi P word2 
and another for "word2 P wordi". However, to be fair, we kept the total number of 
columns at 8,000. In step 4, we selected the top 8,000 patterns (instead of the top 4,000), 
distinguishing the pattern "wordi P word^' from the pattern "word2 P wordi" (in- 
stead of considering them equivalent). Thus a pattern P with a high frequency is likely 
to appear in two columns, in both possible orders, but a lower frequency pattern might 
appear in only one column, in only one possible order. 

These changes resulted in a slight decrease in performance. Recall dropped from 
56.1% to 55.3% and precision dropped from 56.8% to 55.9%. The decrease is not sta- 
tistically significant. However, the modified algorithm no longer obeys equation ||S). 
Although dropping symmetry appears to cause no significant harm to the performance 
of the algorithm on the SAT questions, we prefer to retain symmetry, to ensure that 
equation Jsj is satisfied. 

Note that, if A:B::C:D, it does not follow that B:A::C:D. For example, it is false that 
"stone is to mason as carpenter is to wood". In general (except when the semantic 
relations between A and B are symmetrical), we have the following inequality: 

s\WL,{A:B,C:D) ^ %m\^(B:A,C:D) (9) 

Therefore we do not want A:B and B:A to be represented by identical row vectors, al- 
though it would ensure that equation is satisfied. 

6.7 All Alternates versus Better Alternates 

In step 12 of LRA, the relational similarity between A:B and C:D is the average of the 
cosines, among the {num_filter+l)^ cosines from step 11, that are greater than or equal 
to the cosine of the original pairs, A:B and C:D. That is, the average includes only those 
alternates that are "better" than the originals. Taking all alternates instead of the better 
alternates, recall drops from 56.1% to 40.4% and precision drops from 56.8% to 40.8%. 
Both decreases are statistically significant with 95% confidence, according to the Fisher 
Exact Test jAgresti, 1990) . 
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Table 18 

Performance as a fimction of A''. 



N 


Correct 


Incorrect 


Skipped 


Precision 


Recall 


F 


1 


114 


179 


81 


38.9 


30.5 


34.2 


3 


146 


206 


22 


41.5 


39.0 


40.2 


10 


167 


201 


6 


45.4 


44.7 


45.0 


30 


174 


196 


4 


47.0 


46.5 


46.8 


100 


178 


192 


4 


48.1 


47.6 


47.8 


300 


192 


178 


4 


51.9 


51.3 


51.6 


1000 


198 


172 


4 


53.5 


52.9 


53.2 


3000 


207 


163 


4 


55.9 


55.3 


55.6 



6.8 Interpreting Vectors 

Suppose a word pair A:B corresponds to a vector r in the matrix X. It would be con- 
venient if inspection of r gave us a simple explanation or description of the relation 
between A and B. For example, suppose the word pair ostrich:bird maps to the row 
vector r. It would be pleasing to look in r and find that the largest element corresponds 
to the pattern "is the largest" (i.e., "ostrich is the largest bird"). Unfortunately, inspec- 
tion of r reveals no such convenient patterns. 

We hypothesize that the semantic content of a vector is distributed over the whole 
vector; it is not concentrated in a few elements. To test this hypothesis, we modified 
step 10 of LRA. Instead of projecting the 8,000 dimensional vectors into the 300 dimen- 
sional space UfcSfe, we use the matrix U^EfeV^. This matrix yields the same cosines 
as UfeEfc, but preserves the original 8,000 dimensions, making it easier to interpret the 
row vectors. For each row vector in UfcI]fcV|^, we select the N largest values and set 
all other values to zero. The idea here is that we will only pay attention to the N most 
important patterns in r; the remaining patterns will be ignored. This reduces the length 
of the row vectors, but the cosine is the dot product of normalized vectors (all vectors 
are normalized to imit length; see equation 0), so the change to the vector lengths has 
no impact; only the angle of the vectors is important. If most of the semantic content is 
in the N largest elements of r, then setting the remaining elements to zero should have 
relatively little impact. 

TablelTSlshows the performance as N varies from 1 to 3,000. The precision and recall 
are significantly below the baseline LRA until N > 300 (95% confidence. Fisher Exact 
Test). In other words, for a typical SAT analogy question, we need to examine the top 
300 patterns to explain why LRA selected one choice instead of another. 

We are currently working on an extension of LRA that will explain with a single 
pattern why one choice is better than another. We have had some promising results, but 
this work is not yet mature. However, we can confidently claim that interpreting the 
vectors is not trivial. 



6.9 Manual Patterns versus Automatic Patterns 

[Turney and Liftman (2005 1 used 64 manually generated patterns whereas LRA uses 4,000 
automatically generated patterns. We know from Section l631 that the automatically gen- 
erated patterns are significantly better than the manually generated patterns. It may be 
interesting to see how many of the manually generated patterns appear within the auto- 
matically generated patterns. If we require an exact match, 50 of the 64 manual patterns 
can be found in the automatic patterns. If we are lenient about wildcards, and count 
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the pattern "not the" as matching "* not the" (for example), then 60 of the 64 manual 
patterns appear within the automatic patterns. This suggests that the improvement in 
performance with the automatic patterns is due to the increased quantity of patterns, 
rather than a qualitative difference in the patterns. 

[Turney and Liftman (2 005 1 point out that some of their 64 patterns have been used 
by other researchers. For example, Hearst (1992'^ used the pattern "such as" to dis- 
cover hyponyms and Berland and Charniak (1999 j used the pattern "of the" to discover 
meronyms. Both of these patterns are included in the 4,000 patterns automatically gen- 
erated by LRA. 

The novelty in Turney and Liftman (2005) is that their patterns are not used to mine 



text for instances of word pairs that fit the patterns ^Hearst, 1992t|Berland and Charniak, 1999t ; 
instead, they are used to gather frequency data for building vectors that represent the 
relation between a given pair of words. The results in Section l6^ show that a vector con- 
tains more information than any single pattern or small set of patterns; a vector is a dis- 
tributed representation. LRA is distinct from He arst (1992 ^ and Berland and Charniak (1 999| 
in its focus on distribu ted representations, which i t shares with Turney and Littman (2005| 



but LRA goes beyond Turney and Littman (2005^ by finding patterns automatically. 
Riloff and Jones (19991 and 'Yangarber (20031 also find patterns automatically, but 



999} 



their goal is to mine text for instances of word pairs; the same goal as Hearst (1992^ 
and jBerland and Charniak (1999t . Because LRA uses patterns to build distributed vec- 
tor representations, it can exploit patterns that would be much too noisy and unreli- 



able for the kind of text minin g instance extraction th at is t he objective of .Hearst (1992j , 
Berland and Charniak (1999} , |Riloff and Jones (19991 , and [Yangarber (2003| . Therefore 



LRA can simply select the highest frequency patterns (step 4 in Section l5.5> : it does not 
need the more sophisticated selection algorithms of ,Riloff and Jones (1999j and , Yangarber (2003| . 

7. Experiments with Noun-Modifier Relations 

This section describes experi ments with 600 notm-modifier pa irs, hand-labeled with 30 
classes of semantic relations iNastase and Szpakowicz, 20031. In the following exper- 
iments, LRA is used with the baseline parameter values, exactly as described in Sec- 
tion |^1 No adjustments were made to tune LRA to the noun-modifier pairs. LRA is 
used as a distance (nearness) measure in a single nearest neighbour supervised learning 
algorithm. 

7.1 Classes of Relations 

The following experiments use the 600 labeled noun-modifier pairs of Nastase and Szpakowicz (2003) . 



This data set includes information about the part of speech and WordNet synset (syn- 
onym set; i.e., word sense tag) of each word, but our algorithm does not use this infor- 
mation. 

TablelT9llists the 30 classes of semantic relations. The table is based on Appendix A 
of Nastase and Szpakowicz (2003 1, with some simplifications. The original table listed 
several semantic relations for which there were no instances in the data set. These were 
relations that are typically expressed with longer phrases (three or more words), rather 
than noun-modifier word pairs. For clarity, we decided not to include these relations in 
Tableini 

In this table, H represents the head noun and A I represents the modifier. For ex- 
ample, in "flu virus", the head noun {H) is "virus" and the modifier {M) is "flu" (*). In 
English, the modifier (typically a noun or adjective) usually precedes the head noun. In 
the description of purpose, V represents an arbitrary verb. In "concert hall", the hall is 
for presenting concerts (V is "present") or holding concerts {V is "hold") (f ). 
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Nastase and Szpakowicz (2003| organized the relations into groups. The five cap- 
italized terms in the "Relation" column of Table are the names of five groups of 
semantic relations. (The original table had a sixth group, but there are no examples of 
this group in the data set.) We make use of this grouping in the following experiments. 



7.2 Baseline LRA with Single Nearest Neighbour 

The following experiments use single nearest neighbour classification with leave-one- 
out cross-validation. For leave-one-out cross-validation, the testing set consists of a sin- 
gle noun-modifier pair and the training set consists of the 599 remaining noun-modifiers. 
The data set is split 600 times, so that each noun-modifier gets a turn as the testing word 
pair. The predicted class of the testing pair is the class of the single nearest neighbour 
in the training set. As the measure of nearness, we use LRA to calculate the relational 
similarity between the testing pair and the training pairs. The single nearest neighbour 
algorithm is a supervised learning algorithm (i.e., it requires a training set of labeled 
data), but we are using LRA to measure the distance between a pair and its potential 
neighbours, and LRA is itself determined in an unsupervised fashion (i.e., LRA does 
not need labeled data). 

Each SAT question has five choices, so answering 374 SAT questions required cal- 
culating 374 X 5 X 16 = 29, 920 cosines. The factor of 16 comes from the alternate pairs, 
step 11 in LRA. With the noun-modifier pairs, using leave-one-out cross-validation, each 
test pair has 599 choices, so an exhaustive application of LRA would require calculat- 
ing 600 X 599 X 16 = 5,750,400 cosines. To reduce the amount of computation re- 
quired, we first find the 30 nearest neighbours for each pair, ignoring the alternate pairs 
(600 X 599 — 359, 400 cosines), and then apply the full LRA, including the alternates, to 
just those 30 neighbours (600 x 30 x 16 = 288, 000 cosines), which requires calculating 
only 359, 400 + 288, 000 = 647, 400 cosines. 

There are 600 word pairs in the input set for LRA. In step 2, introducing alternate 
pairs multiplies the number of pairs by four, resulting in 2,400 pairs. In step 5, for each 
pair A:B, we add B:A, yielding 4,800 pairs. However, some pairs are dropped because 
they correspond to zero vectors and a few words do not appear in Lin's thesaurus. The 
sparse matrix (step 7) has 4,748 rows and 8,000 columns, with a density of 8.4%. 

Following Turney and Liftman (2005"), we evaluate the performance by accuracy 
and also by the macroaveraged F measure I L ewis, 1991| l. Macroaveraging calculates 
the precision, recall, and F for each class separately, and then calculates the average 
across all classes. Microaveraging combines the true positive, false positive, and false 
negative counts for all of the classes, and then calculates precision, recall, and F from 
the combined counts. Macroaveraging gives equal weight to all classes, but microaver- 
aging gives more weight to larger classes. We use macroaveraging (giving equal weight 
to all classes), because we have no reason to believe that the class sizes in the data set 
reflect the actual distribution of the classes in a real corpus. 

Classification with 30 distinct classes is a hard problem. To make the task easier, we 
can collapse the 30 classes to 5 classes, using the grouping that is given in Table IT9l For 
example, agent and beneficiary both collapse to participant. On the 30 class problem, LRA 
with the single nearest neighbour algorithm achieves an accuracy of 39.8% (239/600) 
and a macroaveraged F of 36.6%. Always guessing the majority class would result in 
an accuracy of 8.2% (49/600). On the 5 class problem, the accuracy is 58.0% (348/600) 
and the macroaveraged F is 54.6%. Always guessing the majority class would give an 
accuracy of 43.3% (260/600). For both the 30 class and 5 class problems, LRA's accuracy 
is significantly higher than guessing the majority class, with 95% confidence, according 
to the Fisher Exact Test jAgresti, 1990| . 
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Table 19 

Classes of semantic relations, from |Nastase and Szpakowicz (2003) . 



Relation 


Abbr. 


Example phrase 


Description 


Causality 


cause 


cs 


flu virus (*) 


H makes M occur or exist, H is 








necessary and sufficient 


effect 


eff 


exam anxiety 


M makes H occur or exist, M is 






necessary and sufficient 


purpose 


prp 


concert hall (f) 


H is for y-ing M, M does not 








necessarily occur or exist 


detraction 


detr 


headache pill 


H opposes M, H is not sufficient 








to prevent M 


Temporality 


frequency 


freq 


daily exercise 


H occurs every time M occurs 


time at 


tat 


morning exercise 


H occurs when M occurs 


time through 


tthr 


six-hour meeting 


H existed while M existed, M is 








an interval of time 


Spatial 


direction 


dir 


outgoing mail 


H is directed towards A/, M is 








not the final point 


location 


loc 


home town 


H is the location of M 


location at 


lat 


desert storm 


H is located at M 


location from 


Ifr 


foreign capital 


H originates at M 


Participant 


agent 


ag 


student protest 


M performs H, M is animate or 






natural phenomenon 


beneficiary 


ben 


student discount 


M benefits from H 


instrument 


inst 


laser printer 


H uses M 


object 


obj 


metal separator 


M is acted upon by H 


object property 


obj_prop 


sunken ship 


H underwent M 


part 


part 


printer tray 


H is part of M 


possessor 


posr 


national debt 


M has H 


property 


prop 


blue book 


HisM 


product 


prod 


plum tree 


H produces M 


source 


src 


olive oil 


M is the source of H 


stative 


St 


sleeping dog 


is in a state of M 


whole 


whl 


daisy chain 


M is part of H 


Quality 


container 


cntr 


film music 


M contains H 


content 


cont 


apple cake 


M is contained in H 


equative 


eq 


player coach 


H is also M 


material 


mat 


brick house 


H is made of M 


measure 


meas 


expensive book 


M is a measure of H 


topic 


top 


weather report 


H is concerned with M 


type 


type 


oak tree 


M is a type of H 
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Table 20 

Comparison of LRA and VSM on the 30 class problem. 





VSM-AV 


VSM-WMTS 


LRA 


Correct 


167 


148 


239 


Incorrect 


433 


452 


361 


Total 


600 


600 


600 


Accuracy 


27.8 


24.7 


39.8 


Precision 


27.9 


24.0 


41.0 


Recall 


26.8 


20.9 


35.9 


F 


26.5 


20.3 


36.6 




Table 21 








Comparison of LRA and VSM on the 5 class problem. 








VSM-AV 


VSM-WMTS 


LRA 


Correct 


274 


264 


348 


Incorrect 


326 


336 


252 


Total 


600 


600 


600 


Accuracy 


45.7 


44.0 


58.0 


Precision 


43.4 


40.2 


55.9 


Recall 


43.1 


41.4 


53.6 


F 


43.2 


40.6 


54.6 



7.3 LRA versus VSM 

Table EHI shows the performance of LRA and VSM on the 30 class problem. VSM-AV is 
VSM with the AltaVista corpus and VS M-WMTS is VSM with the WMTS corpus. The 
results for VSM-AV are taken from Turney and L iftman (2005 All three pairwise differ- 
ences in the three F measures are statistically significant at the 95% level, according to 
the Paired T-Test iFeelders and Verkooijen, 19951. The accuracy of LRA is significantly 
higher than the accuracies of VSM-AV and VSM-WMTS, according to the Fisher Exact 
Test ( Agresti, 1990| , but the difference between the two VSM accuracies is not signifi- 
cant. 

Table l2Tl compares the performance of LRA and VSM on the 5 class problem. The 
accuracy and F measure of LRA are significantly higher than the accuracies and F mea- 
sures of VSM-AV and VSM-WMTS, but the differences between the two VSM accuracies 
and F measures are not significant. 

8. Discussion 

The experimental results in Sections |6| and |7| demonstrate that LRA performs signifi- 
cantly better than the VSM, but it is also clear that there is room for improvement. The 
accuracy might not yet be adequate for practical applications, although past work has 
shown that it is possible to adjust the tradeoff of precision versus recall I Turney and Lift man, 2005| 
For some of the applications, such as information extraction, LRA might be suitable if it 
is adjusted for high precision, at the expense of low recall. 

Another limitation is speed; it took almost nine days for LRA to answer 374 analogy 
questions. However, with progress in computer hardware, speed will gradually become 
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less of a concern. Also, the software has not been optimized for speed; there are several 
places where the efficiency could be increased and many operations are parallelizable. 
It may also be possible to precompute much of the information for LRA, although this 
would require substantial changes to the algorithm. 

The difference in performance between VSM-AV and VSM-WMTS shows that VSM 
is sensitive to the size of the corpus. Although LRA is able to surpass VSM-AV when the 
WMTS corpus is only about one tenth the size of the AV corpus, it seems likely that LRA 
would perform better with a larger corpus. The WMTS corpus requires one terabyte of 
hard disk space, but progress in hardware will likely make ten or even one hundred 
terabytes affordable in the relatively near future. 

For noun-modifier classification, more labeled data should yield performance im- 
provements. With 600 noun-modifier pairs and 30 classes, the average class has only 
20 examples. We expect that the accuracy would improve substantially with five or ten 
times more examples. Unfortunately, it is time consuming and expensive to acquire 
hand-labeled data. 

Another issue with noun-modifier classification is the choice of classification scheme 



for the semantic relations. The 30 classes of Nastase and Szpakowicz (2003} might not be 
the best scheme. Other researchers have proposed different schemes I Vanderwende, 1994': 



! proposed ( 

[Barker and Szpakowicz, 1998^ Rosario and Hearst, 2001 : Rosario, Hearst, and Fillmore, 2002^ . 
It seems likely that some schemes are easier for machine learning than others. For some 
applications, 30 classes may not be necessary; the 5 class scheme may be sufficient. 

LRA, like VSM, is a corpus-based approach to measuring relational similarity. Past 
work suggests that a hybrid approach, combining multiple mo dules, some corpus - 
based, some lexicon-based, will surpass any purebred approach iTurney et al., 2003| . 
In future work, it would be natural to combine the corpus-based approach of LRA with 
the lexicon-based approach of |Veale (2004} , perhaps using the combination method of 



Turneyetal. (2003). 



The Singular Value Decomposition is only one of many methods for handling sparse, 
noisy data. We have also experimented with Nonnegative Matrix Factorization (NMF) 
^Lee and Seung, 1999 1, Probabilistic Latent Semantic Analysis (PLSA) I Hofmann, 1999^ 
Kernel Principal Components Analysis (KPCA) I Scholkopf, Smola, and MuUer, 1997 1, 
and Iterative Scaling (IS) jAndo, 2000} . We had some interesting results with small ma- 
trices (around 2,000 rows by 1,000 columns), but none of these methods seemed sub- 
stantially better than SVD and none of them scaled up to the matrix sizes we are using 
here (e.g., 17,232 rows and 8,000 columns; see Section l6ll . 

In step 4 of LRA, we simply select the top num_patterns most frequent patterns 
and discard the remaining patterns. Perhaps a more sophisticated selection algorithm 
would improve the performance of LRA. We have tried a variety of ways of selecting 
patterns, but it seems that the method of selection has little impact on performance. We 
hypothesize that the distributed vector representation is not sensitive to the selection 
method, but it is possible that future work will find a method that yields significant 
improvement in performance. 



9. Conclusion 



This paper has introduced a new method for calculating relational similarity. Latent 
Relational Analysis. The experiments demonstrate that LRA performs better than the 
VSM approach, when evaluated with SAT word analogy questions and with the task of 
classifying noun-modifier expressions. The VSM approach represents the relation be- 
tween a pair of words with a vector, in which the elements are based on the frequencies 
of 64 hand-built patterns in a large corpus. LRA extends this approach in three ways: 
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(1) the patterns are generated dynamically from the corpus, (2) SVD is used to smooth 
the data, and (3) a thesaurus is used to explore variations of the word pairs. With the 
WMTS corpus (about 5 x 10^" English words), LRA achieves an F of 56.5%, whereas the 
F of VSM is 40.3%. 

We have presented several examples of the many potential applications for mea- 
sures of relational similarity. Just as attributional similarity measures have proven to 
have many practica l uses, we expect tha t relational similarity measures will soon be- 
come widely used. [Centner et al. (2001} argue that relational similarity is essential to 
understanding novel metaphors (as opposed to conventional metaphors). Many re- 
searchers have argued that metaphor is the heart of human thinking |,Lakoff and Johnson, 1980| 
[Hofstadter and the Fluid Analogies Research Group, 1995 [Gentner et al., 200H|French, 2002t . 
We believe that relational similarity plays a fundamental role in the mind and therefore 
relational similarity measures could be crucial for artificial intelligence. 

In future work, we plan to investigate some potential applications for LRA. It is pos- 
sible that the error rate of LRA is still too high for practical applications, but the fact that 
LRA matches average human performance on SAT analogy questions is encouraging. 
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